Research workflow with confidential data:
The ‘Expert’ BPLIM researcher Workflow

2023-12-18

Access to BPLIM’s microdata

  1. https://bplim.bportugal.pt
  2. Guide for Researchers
  3. Application Form
  4. Confidentiality Agreement

BPLIM’s GitHub

  1. https://github.com/BPLIM
  2. We have made available on GitHub the tools developed by the BPLIM team, statistical packages, and containers, as well as the documentation associated with each of the databases.

Remote access to BPLIM’s computational facilities

NoMachine: Download and install

NoMachine

Remote server: NoMachine connection

Add and configure connection

Remote server: Desktop

Remote server: File manager

Dolphin

Remote server: Initial dataset

initial_dataset folder

Remote server: Tools

tools folder

Remote server: Stata

  • Profile and template do-file

Remote server: Additional packages


  • In case we need additional packages we ask BPLIM’s staff to install them (tools folder)

  • A more flexible approach is the use of containers – reproducibility and autonomy regarding packages and versions

Remote server: Containers, the concept


“A container is a lightweight, stand-alone, executable package of software that includes everything needed to run a piece of software, including the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and the host system. This isolation allows for efficient, reliable, and consistent deployment of applications, regardless of the environment.” (ChatGPT, 2023)

Remote server: Stata running in a Container


How to build a container?


The concept of a definition file


  • “Text document that serves as a blueprint for creating a Singularity container image. This file, typically having a .def extension, contains specific instructions and settings for the container. It outlines the base environment, including the base OS, any required applications, libraries, and dependencies.” (ChatGPT, 2023)

  • A detailed manual on how to build and use containers is available at BPLIM’s GitHub:

https://github.com/BPLIM/Containers/tree/main/Manual

How to build a container?


Definition files are available at BPLIM’s GitHub: https://github.com/BPLIM/Containers


Remote server: Git is available

The concept

  • “Git is a distributed version control system, primarily used for source code management in software development. It allows multiple developers to work on the same project simultaneously without interfering with each other’s changes. Git tracks the progress of changes in a series of snapshots, enabling users to revert back to previous versions of their work if necessary. It’s known for its speed, data integrity, and support for distributed, non-linear workflows.” (ChatGPT, 2023)

  • A detailed manual on how to setup and use Git in the remote server is available at BPLIM’s GitHub:

https://github.com/BPLIM/Manuals/tree/master/ExternalServer/Git

Replication App

BPLIM Team developed a tool to streamline the replicability of the research project.

  • Research project’s folder structure

Replication App

  • Using Dolphin, go to work_area and click in ReplicationApp.desktop

ReplicattionApp icon

Replication App

  • Fill the boxes with the information from the project

Replication App

  • Fill the boxes with the information from the project

Replication App

  • master.do file

Replication App

  • Fill the boxes with the information from the project: Container and definition file

Replication App

  • Fill the boxes with the information from the project: Dependencies

Replication App

  • Fill the boxes with the information from the project: Tools

Replication App

  • Fill the boxes with the information from the project: Run

Replication App

  • Fill the boxes with the information from the project: Replication output

Replication App

  • Fill the boxes with the information from the project: Replication output

Replication App

  • Fill the boxes with the information from the project: Replication output

Replication App

  • Fill the boxes with the information from the project: Replication output

Replication App

  • Fill the boxes with the information from the project: Replication output

Replication App: json file


In work_area folder the file structure.json has the different sets of information

Replication App: Outcomes


  • Folder ados: ado files programmed by the researcher.

  • Folder code: contains the code used to replicate all the analysis performed by the researcher.

  • Folder results: outcomes of the statistical analysis. This is the folder that will be shared with the researcher after output control.

Appendix

How to build a container?


Using the container available in Sylabs

Library for Public Images: https://cloud.sylabs.io/library/reisportela/bplim/bplim_stata17_python310

How to build a container?


Build your container using Sylabs


  1. Go to Sylabs, https://cloud.sylabs.io/, Sign up and Sign in

  2. Go to Remote Builder

  3. Copy/paste the definition file into the text box

  4. Give a name to the container and click in Submit Build

How to build a container?

Build your container using your local machine

  1. Use the following definition file as a template

BPLIM_Stata17_Python310_from_Sylabs_V4.def

  1. To build the container you must have a valid Stata 17 license

  2. When building the container the file Stata_ados_BASE.do is used to install the ado files you need

  3. In case you need additional Linux packages in your container they can be added in the section %post of the definition file. See further details at https://github.com/BPLIM/Containers/tree/main/Stata

Big data in Stata: parquet files


The use of parquet files is made available by Mauricio Caceres and can be used in the remote server


  1. Open a Terminal
  2. Launch the container using the the command line
  3. See the example that opens a Stata file, saves it as parquet and reads the parquet file